6 research outputs found

    Optimizing the Performance of Multi-threaded Linear Algebra Libraries Based on Task Granularity

    Get PDF
    Linear algebra libraries play a very important role in many HPC applications. As larger datasets are created everyday, it also becomes crucial for the multi-threaded linear algebra libraries to utilize the compute resources properly. Moving toward exascale computing, the current programming models would not be able to fully take advantage of the advances in memory hierarchies, computer architectures, and networks. Asynchronous Many-Task(AMT) Runtime systems would be the solution to help the developers to manage the available parallelism. In this Dissertation we propose an adaptive solution to improve the performance of a linear algebra library based on a set of compile-time and runtime characteristics including the machine architecture, the expression being evaluated, the number of cores to run the application on, the type of the operation, and also the size of the matrices, to get as close as possible to the highest performance. Our focus is on machine learning applications, where we are potentially dealing with very large matrices, which could make creating temporaries very expensive. For this purpose we selected Blaze C++ library, a high performance template-based math library that gives us this option to access the expression tree at compile time, along with HPX, a C++ standard library for concurrency and parallelism, as our runtime system. HPX, as an AMT runtime system, offers scalibility and fine-grained parallelism through creating lightweight threads for fast context switching between the threads. Finding the optimum task granularity is a challenge in AMTs. Creating too many small tasks would result in performance degradation due to scheduling overhead of the tasks, and creating too few tasks would lead to under-utilization of the resources. Our work focuses on finding the optimum task granularity for each specific problem. We tried two different approaches to model the relationship between the performance and the grain size, in order to find a range of grain sizes that could lead us to the maximum performance. First, we used polynomial functions to model how throughput changes in terms of the grain size, and number of cores. Although this method was successful in finding the range of grain sizes for maximum throughput, it was not physical. This motivated us to go deeper and try to develop an analytical model for execution time in terms of grain size for balanced parallel for loops. Based on the analytical model we propose a method to predict the range of grain size for minimum execution time. Moreover, since the parameters of the proposed model only depend on the system architecture, we suggest to use a parallel for-loop benchmark to find these parameters on a system, and use it to find the range of grain size for minimum execution time for arbitrary balanced parallel for-loop applications ran on the same machine. Having the mentioned models, we changed the current implementation of the HPX backend for Blaze by adding two parameters to represent the unit of work, and the number of units included in each task, for fine grained control of the parallelism, which is possible through HPX runtime system. Also, a complexity estimation function has been added to Blaze to estimate the number of floating point operations occurring in each unit of work. The model parameters estimated through the parallel for-loop benchmark could also be plugged into Blaze at compile time, in order to find the optimum range of grain size at runtime based on the matrix sizes and complexity of the operations. In the next step, we used the identified range of grain size to extend the previous implementation of splittable tasks, as an algorithm to control task granularity. We modified the current implementation by scheduling the tasks on idle cores directly instead of waiting for them to be stolen, and integrating the lower-bound of the analytical model as the threshold to stop splitting, in order to adapt the threshold to the system architecture and the application being executed

    An Introduction to hpxMP: A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System

    Full text link
    Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards conforming runtime systems such as HPX, giving users the ability to simply utilize fork-join parallelism in their own codes. Despite innovations in runtime systems and standardization efforts users face enormous challenges porting legacy applications. Not only must users port their own codes, but often users rely on highly optimized libraries such as BLAS and LAPACK which use OpenMP for parallization. Current efforts to create smooth migration paths have struggled with these challenges, especially as the threading systems of AMT libraries often compete with the treading system of OpenMP. To overcome these issues, our team has developed hpxMP, an implementation of the OpenMP standard, which utilizes the underlying AMT system to schedule and manage tasks. This approach leverages the C++ interfaces exposed by HPX and allows users to execute their applications on an AMT system without changing their code. In this work, we compare hpxMP with Clang's OpenMP library with four linear algebra benchmarks of the Blaze C++ library. While hpxMP is often not able to reach the same performance, we demonstrate viability for providing a smooth migration for applications but have to be extended to benefit from a more general task based programming model

    Asynchronous Execution of Python Code on Task Based Runtime Systems

    Get PDF
    Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards

    Endovascular Repair of Supra-Celiac and Abdominal Aortic Pseudo Aneurysms Concomitant with a Right Atrial Mass in a Patient with Behçet’s Disease: A Case Report

    No full text
    Behcet’s disease is a rare immune mediated systemic vasculitis which besides it’s more frequent involvement of eyes and skin,   sometimes present with aortic pseudo aneurysm and more rarely cardiac inflammatory masses.A 51-year-old patient with Behçet’s Disease presented with two symptomatic aortic pseudoaneurysms concomitant with a right atrial mass. Computed tomography (CT) revealed one supra-celiac and another infrarenal aortic pseudoaneurysms. Echocardiography showed a large mobile mass in the right atrium. Both pseudoaneurysms were successfully excluded simultaneously via endovascular approach with Zenith stent-grafts, and the atrial mass was surgically removed 10 days later. Post-implant CT showed successful exclusion of both pseudo-aneurysms, patency of all relevant arteries, and patient is now asymptomatic and has returned to normal lifestyle. Multiple pseudoaneurysms concomitant with a right atrial mass can be an initial manifestation of Behçet’s disease. Endovascular repair can be a good treatment option for the pseudoaneurysms
    corecore